首页> 外文OA文献 >Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability
【2h】

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

机译:对文本中任意长的短语进行稀疏特征选择   语料库,侧重于可解释性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We propose a general framework for topic-specific summarization of large textcorpora, and illustrate how it can be used for analysis in two quite differentcontexts: an OSHA database of fatality and catastrophe reports (to facilitatesurveillance for patterns in circumstances leading to injury or death) andlegal decisions on workers' compensation claims (to explore relevant case law).Our summarization framework, built on sparse classification methods, is acompromise between simple word frequency based methods currently in wide use,and more heavyweight, model-intensive methods such as Latent DirichletAllocation (LDA). For a particular topic of interest (e.g., mental healthdisability, or chemical reactions), we regress a labeling of documents onto thehigh-dimensional counts of all the other words and phrases in the documents.The resulting small set of phrases found as predictive are then harvested asthe summary. Using a branch-and-bound approach, this method can be extended toallow for phrases of arbitrary length, which allows for potentially richsummarization. We discuss how focus on the purpose of the summaries can informchoices of regularization parameters and model constraints. We evaluate thistool by comparing computational time and summary statistics of the resultingword lists to three other methods in the literature. We also present a new Rpackage, textreg. Overall, we argue that sparse methods have much to offer textanalysis, and is a branch of research that should be considered further in thiscontext.
机译:我们为大型文本集的特定主题汇总提供了一个通用框架,并说明了如何将其用于两个截然不同的上下文中的分析:OSHA死亡和灾难报告数据库(以促进对导致伤害或死亡的情况下的模式进行监视)和合法的我们的摘要框架建立在稀疏分类方法的基础上,在目前广泛使用的基于简单词频的方法与重量级,模型密集型方法(例如Latent DirichletAllocation( LDA)。对于感兴趣的特定主题(例如,心理健康障碍或化学反应),我们将文档的标签回归到文档中所有其他单词和短语的高维计数上,然后将所得的少量短语用作预测性短语作为总结。使用分支定界方法,可以将该方法扩展为允许使用任意长度的短语,从而可以实现潜在的丰富摘要。我们讨论如何专注于摘要的目的可以为正则化参数和模型约束的选择提供信息。我们通过比较结果词列表的计算时间和摘要统计数据与文献中的其他三种方法来评估此工具。我们还提出了一个新的Rpackage textreg。总的来说,我们认为稀疏方法可以提供很多文本分析,这是研究的一个分支,应在此背景下进一步加以考虑。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号